Introduction

I did a search on pixabay for the term: fighter jets. I have always been fascinated by aircraft and especially military aircraft. So, I took the opportunity to look at some cool pictures of planes.

I noticed some features that seemed to generally apply to the photos:

photo_data %>% 
  select(`Picture Links` = pageURL) %>%
  slice(1:5) %>%
  knitr::kable()
Picture Links
https://pixabay.com/photos/jet-fighter-jet-raaf-hornets-2974131/
https://pixabay.com/photos/fighter-jet-f-15-strike-eagle-63090/
https://pixabay.com/photos/jet-ships-sea-flying-ocean-6234694/
https://pixabay.com/photos/us-navy-military-aviation-aircraft-8008429/
https://pixabay.com/photos/clouds-dramatic-clouds-fighter-jet-251293/

Key Features of Selected Photos

What was the range of number of views for these photos?

# Range of view counts
view_range <- photo_data %>% 
  pull(views) %>% 
  range()

Before we analysed the data we observed that the photos had a wide variety of view counts and upon inspection of the data we find that from our selected photos, the one with the least views had 340 views and the one with the most views had 103783 views.

How many different users are associated with these photos?

# Number of users
num_users <- photo_data %>% 
  pull(user) %>% 
  unique() %>% 
  length()

Each photo in pixabay has a user associated with it and our selection of photos are associated with 30 different users. Here are the top 5!

photo_data %>%
  group_by(user) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  slice(1:5) %>%
  knitr::kable()
user count
onkelglocke 7
Military_Material 5
12019 4
Netloop 4
SimoneVomFeld 4

How likely is a photo to be downloaded after it is viewed?

# Middle 75% of download view percentage
mid_75_down_view_percentage <- photo_data %>% 
  pull(download_view_proportion) %>% 
  quantile(c(0.125, 0.875)) * 100

From my investigation I found that the percentage of people who downloaded a photo after viewing it was between 45.3% and 75.8% for 75% (12.5% to 87.5% quantile) of the photos.

Learning Reflection

I really enjoyed this module and learned about a lot of important ideas. Firstly, I really liked learning about APIs and using them to obtain data. APIs make it easy to obtain relevant and current data for analysis. We also learned about JSON data and how to work with it. Finally, I developed more skills with regards to the manipulation of data frames to get the information that I want from them.

I am really interested to learn more about different data sources and expand my knowledge of APIs. This would allow me to easily gather data to answer questions that I might have.

Appendix

# Load packages
library(tidyverse)
library(jsonlite)
library(magick)
library(ggwordcloud)

# Load data
json_data <- fromJSON("pixabay_data.json")
pixabay_photo_data <- json_data$hits

# Filter and select some of the photos
# Get list of column names
names(pixabay_photo_data)

# Keep only some variables, create new variables, and filter images to reduce number
# download_view_proportion is the proportion of people who downloaded an image after viewing it
# popularity ranks images logarithmically into 4 categories according to views
# A popular user is onkelglocke so is_onkelglocke checks if an image's user is onkelglocke
# We filter first by imageWidth and imageHeight to keep only photos of better than 4k resolution
# Since there were still to many photos we filtered to keep only half of the images with imageSize less than the median
selected_photos <- pixabay_photo_data %>% 
  select(previewURL, pageURL, tags, imageWidth, imageHeight, imageSize, views, downloads, collections, likes, comments, user) %>% 
  mutate(download_view_proportion = downloads / views,
         popularity = ifelse(views >= 1e5, "Very Popular",
                             ifelse(views < 1e5 & views >= 1e4, "Popular",
                                    ifelse(views < 1e4 & views >= 1e3, "Average",
                                           "Unpopular"))),
         is_onkelglocke = (user == "onkelglocke")) %>% 
  filter(imageWidth >= 3840 &
           imageHeight >= 2160) %>% 
  filter(imageSize < median(imageSize))

# Save new dataframe as csv file
write_csv(selected_photos, "selected_photos.csv")

# List most popular users for selected photos
selected_photos %>% 
  group_by(user) %>% 
  summarise(n()) %>% 
  arrange(desc(`n()`))

# Data exploration and summary values
glimpse(selected_photos)

# Some exploratory plots
selected_photos %>% 
  ggplot() +
  geom_point(aes(x = download_view_proportion, y = likes,), 
             colour = "#50c4f2")

selected_photos %>% 
  ggplot() +
  geom_point(aes(x = is_onkelglocke, y = download_view_proportion), 
             colour = "#50c4f2")

selected_photos %>% 
  ggplot() +
  geom_point(aes(x = popularity, y = download_view_proportion), 
             colour = "#50c4f2")

selected_photos %>% 
  ggplot() +
  geom_point(aes(x = collections, y = log(views)), 
             colour = "#50c4f2")

selected_photos %>% 
  ggplot() +
  geom_point(aes(x = views, y = likes), 
             colour = "#50c4f2")


# mean likes score (adjusted for views) by popularity
# Do higher popularity photos (logarithmically according to views) get more likes once we account for the effect of views?
selected_photos %>% 
  mutate(like_adj = likes / views * 1e4) %>% 
  group_by(popularity) %>% 
  summarise(mean_likes = mean(like_adj))
# No, there does not seem to be such an effect

# Range of views
view_range <- selected_photos %>% 
  pull(views) %>% 
  range()

# Number of different users
num_users <- selected_photos %>% 
  pull(user) %>% 
  unique() %>% 
  length()

# Number of photos by onkelglocke
selected_photos %>% 
  pull(is_onkelglocke) %>% 
  sum()

# 75% of the selected photos have a percentage of downloads per views between these values
mid_75_down_view_percentage <- selected_photos %>% 
  pull(download_view_proportion) %>% 
  quantile(c(0.125, 0.875)) * 100 %>% 
  round()

# Create animated GIF
photos <- selected_photos %>% 
  pull(previewURL) %>% 
  image_read() %>%
  image_resize(geometry = "250x250")

animated_photos <- image_animate(photos, fps = 1)

animated_photos %>% 
  image_write(path = "my_photos.gif", format = "gif")

# Tags Analysis
# Separate tags into different rows
tags <- selected_photos %>% 
  separate_rows(tags, sep = ", ")

# Arrange tags according to how frequently they are used
popular_tags <- tags %>% 
  group_by(tags) %>% 
  summarise(cnt = n()) %>% 
  arrange(desc(cnt)) %>% 
  pull(tags)

# Five most popular tags
popular_tags[1:5]

# How many unique tags
length(popular_tags)

# Bar chart of the five most popular tags
tags %>% 
  filter(tags %in% popular_tags[1:5]) %>% 
  ggplot() +
  geom_bar(aes(x = tags), fill = "#50c4f2")

# Interesting table
tags_table <- tags %>% 
  group_by(tags) %>% 
  summarise(tag_count = n(),
            mean_views = mean(views),
            mean_downloads = mean(downloads),
            mean_likes = mean(likes),
            largest = paste0(round(max(imageSize) / 1e6, 2), "MB")) %>% 
  arrange(desc(tag_count))

# Try word cloud
tags_table %>% 
  ggplot() +
  geom_text_wordcloud(aes(label = tags, size = tag_count, colour = tag_count)) +
  scale_size_area(max_size = 6) +
  theme_minimal()

Here you can see some more of my data exploration and preparation code.